Addressing scaling challenges in comparative genomics

نویسنده

  • Natalia Golenetskaya
چکیده

Comparative genomics is essentially a form of data mining in large collections of n-ary relations between genomic elements. Increases in the number of sequenced genomes create a stress on comparative genomics that grows, at worse geometrically, for every increase in sequence data. Even modestly-sized labs now routinely obtain several genomes at a time, and like large consortiums expect to be able to perform all-against-all analyses as part of these new multi-genome strategies. In order to address the needs at all levels it is necessary to rethink the algorithmic frameworks and data storage technologies used for comparative genomics. To meet these challenges of scale, in this thesis we develop novel methods based on NoSQL and MapReduce technologies. Using a characterization of the kinds of data used in comparative genomics, and a study of usage patterns for their analysis, we define a practical formalism for genomic Big Data, implement it using the Cassandra NoSQL platform, and evaluate its performance. Furthermore, using two quite different global analyses in comparative genomics, we define two strategies for adapting these applications to the MapReduce paradigm and derive new algorithms. For the first, identifying gene fusion and fission events in phylogenies, we reformulate the problem as a bounded parallel traversal that avoids high-latency graph-based algorithms. For the second, consensus clustering to identify protein families, we define an iterative sampling procedure that quickly converges to the desired global result. For both of these new algorithms, we implement each in the Hadoop MapReduce platform, and evaluate their performance. The performance is competitive and scales much better than existing solutions, but requires particular (and future) effort in devising specific algorithms. t el -0 08 65 84 0, v er si on 1 25 S ep 2 01 3

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Addressing the Omics Data Explosion: a Comprehensive Reference Genome Representation and the Democratization of Comparative Genomics and Immunogenomics

Addressing the Omics Data Explosion: a Comprehensive Reference Genome Representation and the Democratization of Comparative Genomics and Immunogenomics

متن کامل

Comparative genomics of human stem cell factor (SCF)

Stem cell factor (SCF) is a critical protein with key roles in the cell such as hematopoiesis, gametogenesis and melanogenesis. In the present study a comparative analysis on nucleotide sequences of SCF was performed in Humanoids using bioinformatics tools including NCBI-BLAST, MEGA6, and JBrowse. Our analysis of nucleotide sequences to find closely evolved organisms with high similarity by NCB...

متن کامل

Review of Techniques for Gene Sequencing, Annotation and Comparative Genomics

The availability and complete sequencing of many organisms has made comparative analysis of gene a new field of research. The explosion in sequenced genome data on daily basis made this task an enormous one. Several techniques and methods have been devised and applied to carry out genome comparison. In this work, we surveyed and presented an overview of common methods, techniques, tools and cha...

متن کامل

Applications of hidden Markov models for comparative gene structure prediction

Identifying the structure in genome sequences is one of the principal challenges in modern molecular biology, and comparative genomics offers a powerful tool. In this paper we introduce a hidden Markov model that allows a comparative analysis of multiple sequences related by a phylogenetic tree. The model integrates structure prediction methods for one sequence, statistical multiple alignment m...

متن کامل

Addressing NCDs: Penetration of the Producers of Hazardous Products into Global Health Environment Requires a Strong Response; Comment on “Addressing NCDs: Challenges From Industry Market Promotion and Interferences”

Timely warnings and examples of industry interference in relation to tobacco, alcohol, food and breast milk substitutes are given in the editorial by Tangcharoensathien et al. Such interference is rife at national levels and also at the global level. In an era of ‘private public partnerships’ the alcohol and food industries have succeeded in insinuating themselves into the global health environ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013